Search CORE

109 research outputs found

On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

Author: Barrón Cedeño Luis Alberto
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 08/06/2012
Field of study

Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

RiuNet

The (Undesired) Attenuation of Human Biases by Multilinguality

Author: Barrón-Cedeño Alberto
España-Bonet Cristina
Publication venue: place:Toronto
Publication date: 01/01/2022
Field of study

Some human preferences are universal. The odor of vanilla is perceived as pleasant all around the world. We expect neural models trained on human texts to exhibit these kind of preferences, i.e. biases, but we show that this is not always the case. We explore 16 static and contextual embedding models in 9 languages and, when possible, compare them under similar training conditions. We introduce and release CA-WEAT, multilingual cultural aware tests to quantify biases, and compare them to previous English-centric tests. Our experiments confirm that monolingual static embeddings do exhibit human biases, but values differ across languages, being far from universal. Biases are less evident in contextual models, to the point that the original human association might be reversed. Multilinguality proves to be another variable that attenuates and even reverses the effect of the bias, specially in contextual multilingual models. In order to explain this variance among models and languages, we examine the effect of asymmetries in the training corpus, departures from isomorphism in multilingual embedding spaces and discrepancies in the testing measures between languages

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

Author: Barrón-Cedeño Alberto
España-Bonet Cristina
van Genabith Josef
Varga Ádám Csaba
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/11/2017
Field of study

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

1949-2019: 70 anni di TA visti attraverso i dati utilizzati

Author: Barrón-Cedeño Alberto
Scansani Randy
Publication venue
Publication date: 01/01/2020
Field of study

La traduzione automatica (TA) ha subìto diversi mutamenti dal 1940 ad oggi. Come in molti altri campi dell’informatica e dell’intelligenza artificiale, si è passati da risorse sviluppate ad hoc manualmente ad approcci basati sempre di più su dati preesistenti. Il presente contributo si propone di offrire una panoramica delle diverse architetture di TA e dei dati da esse richiesti, partendo dagli approcci rule-based e arrivando alle architetture statistiche, examplebased e neurali. Ognuno di questi cambiamenti ha influito sulla tipologia di dati richiesti per la costruzione di motori di TA. Se i primi approcci non richiedevano l’utilizzo di frasi allineate, con la TA statistica è diventato imprescindibile poter fare affidamento su una grande quantità di dati paralleli. Oggi, grazie all’utilizzo delle reti neurali, è possibile ottenere una traduzione di buona qualità persino per combinazioni per cui non sono disponibili dati in entrambe le lingue

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

UniBO @ AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo

Author: Barrón-Cedeño Alberto
Muti Arianna
Publication venue: 'OpenEdition'
Publication date: 11/05/2021
Field of study

We describe our participation in the EVALITA 2020 (Basile et al., 2020) shared task on Automatic Misogyny Identification. We focus on task A —Misogyny and Aggressive Behaviour Identification— which aims at detecting whether a tweet in Italian is misogynous and, if so, whether it is aggressive. Rather than building two different models, one for misogyny and one for aggressiveness identification, we handle the problem as one single multi-label classification task, considering three classes: non-misogynous, non-aggressive misogynous, and aggressive misogynous. Our three-class supervised model, built on top of AlBERTo, obtains an overall F1 score of 0.7438 on the task test set (F1 = 0.8102 for the misogyny and F1 = 0.6774 for the aggressiveness task), which outperforms the top submitted model (F1 = 0.7406)

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

OpenEdition

Wikicardi : hacia la extracción de oraciones paralelas de Wikipedia

Author: Barrón-Cedeño Alberto
Boldoba Trapote Josu
España Bonet Cristina
Publication venue
Publication date: 01/01/2014
Field of study

Uno de los objetivos del proyecto Tacardi (TIN2012-38523-C02-00) consiste en extraer oraciones paralelas de corpus comparables para enriquecer y adaptar traductores automáticos. En esta investigación usamos un subconjunto de Wikipedia como corpus comparable. En este reporte se describen nuestros avances con respecto a la extracción de fragmentos paralelos de Wikipedia. Primero, discutimos cómo hemos definido los tres dominios de interés -ciencia, informática y deporte-, en el marco de la enciclopedia y cómo hemos extraído los textos y demás datos necesarios para la caracterización de los artículos en las distintas lenguas. Después discutimos brevemente los modelos que usaremos para identificar oraciones paralelas y damos sólo una muestra de algunos resultados preliminares. Los datos obtenidos hasta ahora permiten vislumbran que será posible extraer oraciones paralelas de los dominios de interés a corto plazo, si bien aún no contamos con una estimación del volumen de éstos.Preprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Findings of the NLP4IF-2019 Shared Task on Fine-Grained Propaganda Detection

Author: Barrón-Cedeño Alberto
Martino Giovanni Da San
Nakov Preslav
Publication venue
Publication date: 01/01/2019
Field of study

We present the shared task on Fine-Grained Propaganda Detection, which was organized as part of the NLP4IF workshop at EMNLP-IJCNLP 2019. There were two subtasks. FLC is a fragment-level task that asks for the identification of propagandist text fragments in a news article and also for the prediction of the specific propaganda technique used in each such fragment (18-way classification task). SLC is a sentence-level binary classification task asking to detect the sentences that contain propaganda. A total of 12 teams submitted systems for the FLC task, 25 teams did so for the SLC task, and 14 teams eventually submitted a system description paper. For both subtasks, most systems managed to beat the baseline by a sizable margin. The leaderboard and the data from the competition are available at http://propaganda.qcri.org/nlp4if-shared-task/.Comment: propaganda, disinformation, fake news. arXiv admin note: text overlap with arXiv:1910.0251

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Answer Selection in Arabic Community Question Answering: A Feature-Rich Approach

Author: Alberto Barrón-Cedeño
Hamdy Mubarak
Yonatan Belinkov
Publication venue
Publication date: 11/04/2020
Field of study

Abstract The task of answer selection in community question answering consists of identifying pertinent answers from a pool of user-generated comments related to a question. The recent SemEval-2015 introduced a shared task on community question answering, providing a corpus and evaluation scheme. In this paper we address the problem of answer selection in Arabic. Our proposed model includes a manifold of features including lexical and semantic similarities, vector representations, and rankings. We investigate the contribution of each set of features in a supervised setting. We show that employing a feature combination by means of a linear support vector machine achieves a better performance than that of the competition winner (F 1 of 79.25 compared to 78.55)

CiteSeerX

Proppy: A System to Unmask Propaganda in Online News

Author: Barrón-Cedeño Alberto
Jaradat Israa
Martino Giovanni Da San
Nakov Preslav
Publication venue
Publication date: 17/07/2019
Field of study

We present proppy, the first publicly available real-world, real-time propaganda detection system for online news, which aims at raising awareness, thus potentially limiting the impact of propaganda and helping fight disinformation. The system constantly monitors a number of news sources, deduplicates and clusters the news into events, and organizes the articles about an event on the basis of the likelihood that they contain propagandistic content. The system is trained on known propaganda sources using a variety of stylistic features. The evaluation results on a standard dataset show state-of-the-art results for propaganda detection.Comment: propaganda, disinformation, fake new

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Association for the Advancement of Artificial Intelligence: AAAI Publications